Multilevel superposition for deciphering the conformational variability of protein ensembles

Abstract The dynamics and variability of protein conformations are directly linked to their functions. Many comparative studies of X-ray protein structures have been conducted to elucidate the relevant conformational changes, dynamics and heterogeneity. The rapid increase in the number of experimentally determined structures has made comparison an effective tool for investigating protein structures. For example, it is now possible to compare structural ensembles formed by enzyme species, variants or the type of ligands bound to them. In this study, the author developed a multilevel model for estimating two covariance matrices that represent inter- and intra-ensemble variability in the Cartesian coordinate space. Principal component analysis using the two estimated covariance matrices identified the inter-/intra-enzyme variabilities, which seemed to be important for the enzyme functions, with the illustrative examples of cytochrome P450 family 2 enzymes and class A \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} $\beta$\end{document}-lactamases. In P450, in which each enzyme has its own active site of a distinct size, an active-site motion shared universally between the enzymes was captured as the first principal mode of the intra-enzyme covariance matrix. In this case, the method was useful for understanding the conformational variability after adjusting for the differences between enzyme sizes. The developed method is advantageous in small ensemble-size problems and hence promising for use in comparative studies on experimentally determined structures where ensemble sizes are smaller than those generated, for example, by molecular dynamics simulations.


INTRODUCTION
Based on the structures deposited in the Protein Data Bank (PDB), more than 200 000 protein structures have been determined experimentally, and their number is further increasing.In this context, it is important to develop an effective method for drawing connections between functions and structures.
A protein's function directly depends on its static structures as well as on its conformational dynamics and variability [1][2][3][4][5].The covariance matrix of atomic coordinates is a traditional representation of variability.Because of its high dimensionality, it is difficult to accurately determine the off-diagonal components.Fortunately, most protein motions are considered to occur in a subspace spanned by only a small number of eigenvectors of the covariance matrix, which is referred to as essential dynamics [6,7].In this case, principal component analysis (PCA) has been used extensively to understand the dynamic nature of protein conformations, primarily in the analysis of trajectories generated by molecular dynamics (MD) simulations [6,8,9].PCA is also applicable to ensembles composed of a small number of X-ray structures [10][11][12][13][14][15], owing to its dimensional reduction feature.Conformational variability and dynamics have been elucidated from the sample covariance matrices of X-ray structures using PCA.
In some situations, a certain ensemble hierarchy can be observed in the structural data.For example, the X-ray structure of a protein may display different conformations depending on the type of ligand bound to it.In this case, trivial grouping is used to form ensembles based on the ligand type.The variability between ensembles should indicate how the ligands affect the protein's conformational heterogeneity.Other factors that affect conformational differences, such as crystallization conditions or sequence variants, are expected to be balanced between the ensembles; hence, the ligand effect can be efficiently estimated.Thus, introducing a two-level hierarchy by partitioning the data into ensembles by a factor is one way to distinguish factor-specific variability from variations arising from other factors.Similar ideas have been proposed in reports on simulation data [16,17].
The factor-specific and remaining variabilities are referred to as the inter-and intra-ensemble variabilities, respectively.We can consider the latter to refer to factor-adjusted variability rather than variability arising entirely from noise.This is an important aspect of the multilevel approach.For example, when an active-site motion essential for function is shared between family enzymes that differ in the sizes of their active sites, the structural variability of the active site should be observed as a mixture of those arising from the motion and difference in sizes.By removing the inter-ensemble variability caused by the variation in size between ensembles (enzymes), the remaining variability could be the factor (size)-adjusted variability originating from the universal motion of the specific enzyme family.
Factor-based and factor-adjusting approaches to variability can be realized using a random effects model (REM) [18], which is a statistical multilevel model or mixed-effects model without regression coefficients [19] and is composed of a mean and two random variables that represent inter-and intra-ensemble deviations from the mean.
The purpose of this study was to develop a method based on the REM to estimate the covariance matrices in Cartesian coordinate space and demonstrate that the estimates are sufficiently accurate for PCA studies on currently available X-ray structures.First, the method was established by taking additional account of heterogeneous and anisotropic properties of variances.Next, using machine-synthesized data, the accuracy of the estimated covariance matrices was compared with those of alternative methods, particularly for dependencies on the ensemble size and number.Subsequently, the method was applied to the two practical datasets of X-ray structures.One is the application to the enzymes of cytochrome P450 family 2 (CYP2), which is an illustrative example of the factor-adjusting approach, and the other is an example of enzymes of class A β-lactamases (ABLs).The usefulness of this method was demonstrated using these examples.Similar utility can be expected to be found in further comparative studies on protein structures.

Multilevel superposition and variance-covariance estimation
Multilevel superposition is a structural superposition built on an REM.More details on this theory are provided in Supplementary Section S1.

Statistical models and assumptions
Suppose that there are ν ensembles and that ensemble i is composed of m i structures, X ij j=1,••• ,m i .An n×3 matrix X ij contains the Cartesian coordinates of n atoms in three-dimensional space.Let Y ij be the coordinates obtained by translating and rotating X ij , where ⊗ and T denote the Kronecker product and transpose, respectively, t ij is the 3×1 vector, R ij is the 3×3 rotation matrix and 1 n represents the size n vector of ones.In this study, the following hierarchical model was assumed for Y ij : where M is the population mean conformation and Z i and E ij are random deviations and all are n × 3 matrices.This model is also referred to as a multilevel or mixed model.Using this model, we can distinguish between inter-and intra-ensemble variabilities.Z i represents the deviations of the conformations in the i-th ensemble from M and is assumed to be independent and identically distributed as follows: where W is the 3n × 3n covariance matrix of anisotropic interensemble variability.The operator vec represents vectorization, and for an n × 3 matrix A, the corresponding vector a was defined as a = vec A T in this study.In contrast, E ij is a deviation unique to the conformation Y ij .All E ij s were assumed to be independent and identically distributed as follows: The covariance matrix Σ represents anisotropic intraensemble variability.In mixed models for clustered data ( [19], pp.2-4), E ij in Equation ( 2) is usually considered as an error term that is not accounted for by the ensemble deviation Z i .When the estimation of the error variance is not the main objective, the assumption of identical distribution is usually introduced and an effect on the sample size is expected, that is, the effective sample size increases when compared to the case where the individual covariance matrix Σ i is used for each ensemble.In addition, in this study, the assumption also presents another benefit; the matrix Σ contains the information on the conformational variability that all the ensembles have in common, owing to the assumption of identical distribution.
Notice the heterogeneity of Σ, which is the heterogeneity among the variances concerning the atoms, i.e. the diagonal components of Σ.In this paper, this is referred to as heteroscedasticity and is important in the structural superposition, as will be shown later.

Method of REM
The multilevel model presented in Equation ( 2) is a special case of mixed models, that is, the REM.In this study, a method was developed for estimating covariance matrices under the REM, and this method is referred to as the REM.In this method, the distributions of whole ensembles are considered in the superposition of individual ensembles.This procedure is based on the framework of Laird and Ware [18], in which the expectation-maximization (EM) algorithm for the maximum likelihood approach is used.The EM algorithm was used to maximize the likelihood function given as Supplementary Eq. (S7), with respect to the population parameters M, W and Σ.The procedure is guaranteed to converge to a stationary point, i.e. the maximum likelihood estimators (Supplementary Section S1.2).One remarkable aspect of the EM algorithm is its use of unobserved latent variables, such as Z i in this study, as if they were observed.Consequently, in this study, the tentative estimate for the ensemble mean structure M + Z i was used as a target for superposition at each step of the EM algorithm, enabling the application of the REM to the superposition problem.Moreover, the estimation of Z i followed an empirical Bayes approach, where the tentative estimates of W and Σ were integrated into a prior distribution.The incorporation of this approach is anticipated to improve the reliability of the mean ensemble structure M + Z i , particularly for small ensembles (with low m i ).

Iterative weighted least squares
In this study, for a single ensemble, the iterative weighted least squares (IWLS) [20,21] method modified for superposition was used to estimate the covariance matrix.Similar methods were reported by Theobald and Wuttke [22].In these methods, by taking care of the variance heterogeneity among atoms, i.e. heteroscedasticity, less f luctuating regions are tightly superimposed leaving the terminals, for example, f luctuating significantly.Under the assumption of heteroscedastic errors among atoms, as compared to ordinary least squares (OLS), which assumes homoscedastic errors, superposition is expected to be more successfully achieved using these methods in the sense that the estimators concerning superposition are statistically efficient.Although the IWLS in this study does not necessarily converge, convergence is usual [23].A rough estimate of its computational cost was expected to be proportional to the ensemble size.

Two-stage method
For multi-ensemble data, to estimate the covariance matrices W and Σ, the IWLS method can be used in two stages (TSs): first for each ensemble, then for the group of ensemble mean structures.Specifically, the ensemble mean structure M+Z i of each ensemble i is first estimated using IWLS.Next, the M + Z i estimates of all ensembles are superimposed to obtain M and W. Σ is estimated as a pooled variance using the covariance matrices estimated at the first stage (Supplementary Section S1.3).This TS strategy is referred to as TS or TS/IWLS.This is another approach for the multilevel model in Equation (2), in which Z i are considered as population parameters with fixed values rather than random variables (that is, the fixed effects model).

Single pooled ensemble method
As a reference for the comparison of multilevel methods, multiensemble data were merged into a single pooled ensemble (SPE), on which the IWLS estimation was performed assuming The population parameters M and Δ are estimated.This conventional method is referred to as the SPE method or SPE/IWLS.Note the relation Δ ≈ W + Σ.

Numerical tests on synthetic data
Numerical tests were carried out on machine-synthesized data to investigate the effects of the number of ensemble ν and ensemble sizes m i on the accuracy of REM and TS methods.The generated conformations were those of a nucleotide pool-sanitizing enzyme, hMTH1, with n = 156 Cα atoms.Details of the numerical tests are described in Supplementary Section S2.Brief ly, the mean conformation M+Z i of the i-th ensemble was randomly generated from a multivariate normal distribution with mean vec M T and variance-covariance W, which was then used, along with the variance-covariance Σ, to generate conformations belonging to that ensemble.Trajectory data obtained from a series of MD simulations on hMTH1 were used to obtain the TS estimates of M, W and Σ, which were used as true matrices for the numerical tests.Accuracy of the estimated variance v was assessed in terms of the mean absolute error (MAE) over 20 replicates.Accuracy of the eigenvectors was assessed based on their overlap with the corresponding true vectors.
All PDB files listed in the structures section of the UniProt pages were downloaded.The amino acid sequences were aligned using MAFFT [24].Conformations (PDB chains) lacking many residues were omitted.Residues were excluded from the analysis if many conformations were missing.Consequently, a dataset composed of complete cases of 163 conformations of 407 Cα atoms was obtained.Among them, a severe outlier (PDB code 5w0c) was excluded from the final dataset.The PDB and chain IDs of the 162 conformations are listed in Supplementary Table S1.
Each ensemble was constituted by the structures of an identical enzyme, and covariance matrices W and Σ were estimated using REM and were subjected to the eigen-decomposition using the function eigen in R. For comparison, using SPE/IWLS, the covariance matrix Δ for the pooled ensemble members was estimated and similarly factorized.
All PDB files identified by the UniProt IDs on the PDB site were downloaded, except for CTX-M-14, KPC-2, SHV-1 and TEM-1, for which the top 25 matched files were downloaded.Although the majority of files adopted Ambler numbering [25], certain inconsistencies were found for some enzymes.Thus, the sequences were first aligned and renumbered for each enzyme, after which the sequences of all enzymes were aligned using MAFFT.Incomplete chains and residues were omitted.Finally, a dataset composed of complete cases of 149 conformations of 255 Cα atoms was obtained.The PDB and chain IDs of the conformations are listed in Supplementary Table S2.Each ensemble was constituted by the structures of an identical enzyme, and covariance estimation and eigen decomposition were performed as described for CYP2.

Miscellaneous software
The REM and IWLS were coded in C and compiled using GCC 12.1.0.TS was implemented as a bash script in which IWLS and R 4.2.1 [26] were used.R was also used to evaluate accuracy.Molecular drawings were prepared using VMD 1.9.3 [27].

Estimates of variances
The results of the accuracy of the diagonal elements of anisotropic covariance matrices, that is variances, are shown in Figure 1.
In Figure 1, the MAEs of the square root of variances in Ŵ and Σ are plotted as functions of the ensemble size m i .Each value represents the average of over 20 replicates and over 156×3 = 468 diagonal components.
The value of MAE decreased with increasing m i and with increasing ν for every method.The MAEs of the TS/OLS estimates were generally higher than those of the other estimates.The MAEs of the REM estimates of the inter-ensemble standard deviations derived from Ŵ were lower than those of TS in the region where m i was low, and the differences increased with the increase in ν.The MAEs of the estimates of the intra-ensemble standard deviations derived from Σ were low compared with those with inter-ensemble variability.The larger effective sample size for estimating Σ ( ν i m i ) suppressed the magnitude of error as compared with W (ν). The MAEs of the TS/OLS estimates were higher than those of the other estimates, as in the case of W. The other two methods (REM and TS/IWLS) exhibited similar MAE values.

Estimates of eigenvalues and eigenvectors
The results for the eigenvalues are presented in Supplementary Section S3.In short, REM estimates were more accurate than those of TS/IWLS and TS/OLS, in particular when m i and ν were low, although the differences were small compared with the results for the variance estimation.
The accuracy of the eigenvector estimates was evaluated in terms of the mean overlap between the space spanned by the estimates and those of the true vectors.Figure 2 shows the root mean squared inner product (RMSIP) values [28] for the top two dominant eigenvectors of Ŵ and Σ as functions of m i .The RMSIP values were generally high and acceptable as concerning the top two eigenvectors.The RMSIP values for Ŵ were generally lower than those for Σ.An increase in m i effectively improved the accuracy with respect to Σ.
In general, on both W and Σ, there were no large differences between the accuracy of REM and TS/IWLS, except at a low m i where the accuracy of REM was slightly higher than that of TS/I-WLS.This difference disappeared in the results of the top eight eigenvectors (Supplementary Figure S2).The RMSIP values for the TS/OLS estimates were generally lower than those of the other two estimates at every m i and ν.Similar results were obtained for the top eight eigenvectors.

Stability of estimating computation
In the numerical tests of TS/IWLS with m i = 3, the code used in this study did not properly converge to finite values at a rate of once every 14 IWLS estimations.There were no such cases concerning the other methods or other values of m i .

CYP2 family enzymes
The first example is the application to CYP2 family enzymes, which play a major role in drug metabolism, along with other families such as CYP1 and CYP3.Six major human enzymes from the CYP2 family were selected: 2A6, 2B6, 2C8, 2C9, 2D6 and 2E1.
CYP enzymes have F-and G-helices and a loop that connects them.The open/closed motion and f lexibility of this F-G region are believed to be related to the uptake and release of their ligands [29,30].The motion of the F-G region significantly affects the cavity size of the active site [31].In addition, the size of the active site varies between the member enzymes [32].
Figure 3 shows the relationship between the openness of the structures and their coordinates along the principal component 1 (PC1) of the estimated covariance matrices.The openness of each structure was inferred from descriptions in literature.The openness score of 5 indicated that an open conformation is highly likely (PDB codes:2f9q and 3tbg).The score of 4 was assigned when the structure is likely in an open conformation, deduced from descriptions such as substantially more open (3ua5) [33], relatively open (1r9o) [34], expanded through the opening (1pq2) [35] and more open than (5xxi) [36].The score of 3 was assigned to structures that seemed to be in intermediate states between open and closed.More closed structures were defined as those having a score of 2, and structures were defined as having a score of 1 when clearly stated as closed conformations in the references.No scores were attached to the structures without any description of open or closed status in the literature, and these structures were classified as non-determined (ND).For reference, open structures were labeled with their PDB codes in the PC1-PC2 plot of each estimated covariance matrix (Supplementary Figure S3).
The results for Ŵ and Δ were very similar except for the sign on PC1.On the other hand, the PC1 component in Σ seemed more directly related to the open/closed status than the other two.In the plot for Σ (Figure 3C), the open structures were located in the region where PC1 was negative, whereas the majority of closed structures indicated a positive PC1.Thus, the PC1 of Σ appeared to  To confirm this observation, the cavity dimensions were compared (Table 1).The average cavity sizes were in the order 2D6 > 2C8 > 2C9 > 2E1 > 2A6 ≈ 2B6, which was consistent with the order observed for the values of PC1 in Figure 3A and B.
I then investigated what the PC1 coordinate corresponded to in the Cartesian coordinate space.The coordinates in PC1 of Δ and Σ were projected onto the Cartesian coordinate space and then added to the estimated average structures M. Figure 4A and C shows the structures corresponding to the conformations with the maximum and minimum PC1 values for Δ, respectively.Similarly, Figure 4B and D shows the respective structures for Σ.The three rods in each drawing show the dimensions of the cavity, a Dist1 is the distance between two Cα atoms of residues 50 and 213.Dist2 is the distance between the centroid of Cα atoms of residues 103-106 and 108 and that of 232-235.Dist 3 is the distance between the centroid of Cα atoms of residues 103-106 and 108 and that of 476-478.Dist2 and dist3 were defined in accordance with Ref. [36 ].
as defined by dist1, dist2 and dist3 ( Table 1).In transitions (C) to (A) and (D) to (B), the F-G region (helices F and G with a In summary, the results suggested that the REM estimated covariance matrix Σ contained the variability arising from the open/close motion of the CYP2 family after being adjusted for the differences in the native cavity sizes of the member enzymes.

ABLs
The second example is the application of ABLs.ABLs are activesite serine enzymes, similar to class C and D enzymes, and have been intensively studied [37].The prevalence of carbapenemases and extended spectrum beta lactamases (ESBLs), many of which are members of class A (ABLs), is a great concern to public health.ABLs share a similar overall structural architecture surrounding the active site [38,39].The mobility of their loops is reportedly related to ABL function [40].One of the important loops is the Ω-loop, which is a 'hot spot' for substitutions that extend the substrate spectrum of ABLs [41,42].
Figure 6A shows the projections of the structures onto a plane spanned by the top two eigenvectors (PC1 and PC2) of the REMestimated covariance matrix Ŵ.On the PC1 coordinates, carbapenemases were on the positive extreme, whereas non-ESBLs were on the negative extreme.Half of the ESBLs were located near the carbapenemases, and the other half were in the negative region approximately halfway from the origin to the non-ESBLs.
From the perspective of per-residue deviation, PC1 was comprised of peaks at several loops (Figure 8B, left), among which the highest peak in the residues 266-275 corresponded to the heterogeneity in the β9-α12 region.As shown in Figure 7, this portion of the carbapenemases and half of the ESBLs covered over the edge of loop β7-β8 (238-243), whereas the other half of the ESBLs and non-ESBLs pulled the portion to align the edges of the β9-α12 and β7-β8 loops and the Ω loop C-terminal (169-176).Knoverek et al. reported that the penicillinase activity of TEM increases as the opening of the cryptic pocket forms between β7-β8 and Ω loops [44].Interactions between the two loops and the β9-α12 loop may be related to the activity of ABLs.It is also worth noting that the change in the position of the β9-α12 loop was   [39].(B) The outliers of SHV-1 and TEM-1 were neighbors of each other's major clusters.The structures 1pzpA and 1pzoA were reported as core-disrupted structures [43].All structures of the SHV-1-BLIP complex were located in the vicinity of the TEM-1 cluster.The other three chains (A-C) of 5eph were not included in the analysis because of the insufficient coordinates of Cα atoms.coupled to the changes in the loops on the opposite side, i.e. β8-β9 (252-258) and β1-β2 (51)(52)(53)(54)(55).The conformations or motions of these portions may also be related to the activity of ABLs.
The collective motion or joint heterogeneity on the configuration of the three loops was more clearly captured in the PC2 of REM Σ.The plot for the per-residue deviation of PC2 for REM Σ is composed of three peaks at the Ω, β7-β8 and β9-α12 loops (Figure 8C, middle).As shown in Figure 6B, the PC2 values were divided into two groups.These results suggested that there are two main states for the configuration or conformation of these three loops.
Variability in the hinge conformation was detected using REM, as indicated in Figure 8C (left).Specifically, the per-residue deviation of PC1 for REM Σ showed a clear peak in the hinge-α11 region (218-223).A study using Markov state models showed that the hinge regions of TEM-1 and SHV-1 were more f lexible than the highly stable hinge region of KPC-2 [45].Other studies have reported that binding to a beta-lactamase inhibitory protein (BLIP) introduces a conformational change in the hinge region [46,47].These results are consistent with those shown in Figure 6B.The TEM-1 showed two distinct PC1 values depending on the ligand.The SHV1-BLIP complexes were located near the TEM-1 cluster, far from the major cluster of SHV-1.
In summary, the conformational heterogeneity captured in PC1 and PC2 of Ŵ appears to be related to the resistance profiles of the ABLs.With regard to Σ, PC1 indicated conformational heterogeneity in the hinge region, whereas PC2 indicated joint heterogeneity at the three loops of Ω, β7-β8 and β9-α12.

Concluding remarks
In this study, a two-level estimation of covariance matrices in Cartesian coordinates was proposed for comparative studies of protein conformations.The method was based on the REM where variation between the ensembles of conformations (W) was distinguished from that within the ensembles (Σ).
One of the benefits of this method is that the W-adjusted variability is extracted.In the CYP2 example, active-site dynamics shared between the CYP2 enzymes were obtained as the PC1 of Σ being adjusted with the differences in the cavity sizes that were, in turn, obtained as the PC1 of Ŵ.In the example of ABLs, withinensemble variability (Σ) captured the joint heterogeneity of the three adjacent loops as well as the variation in the hinge region.It would be difficult to notice conformational heterogeneity in the hinge region if the traditional non-hierarchical SPE method was employed.Nevertheless, the REM-estimated variability between the ensembles differed little from that estimated by SPE in the case of the ABL structures, as indicated by the high degree of coincidence of the two PC1s indicated.
Notably, in the REM method, the covariance matrices are estimated according to the specified grouping.If the grouping was inadequate, it would be possible to identify artifacts as inter-group variability.Thus, it is important to carefully validate the estimates; for example, by referring to those of the SPE method in which no grouping is used.If a definite grouping is unknown, any clustering method can be used to determine the grouping.For example, in additional experiments, inter-ensemble variability was well estimated based on the grouping determined by k-medoids using root-mean-squared-deviation values.However, the estimates for intra-ensemble variability varied significantly depending on the group sizes, making them less reliable when grouping was determined by clustering methods.
In general, TS/IWLS, the other two-level method used in this study, showed results similar to those of REM.Good agreement was observed between the essential PCs estimated by REM and those estimated by TS/IWLS in the X-ray structure results, except in a few cases.In general, REMs are more efficient (accurate) than fixed effects models (e.g.TS) in the cases where the ensemble sizes are small in comparison with the number of ensembles, m i < ν [19].In addition, the REM method in this study copes with deviations Z i in the empirical Bayes framework, which also implies its advantage in such cases.In other words, information from the other ensembles was used to compensate for the poor sample size m i when estimating Z i and superimposing the structures of ensemble i.
One drawback of REM over TS is its high computational cost.For example, the execution times of REM for CYP2 and ABL were 1117 and 240 s, respectively, using a single core of Intel(R) Core(TM) i9-9900, whereas the respective times of TS/IWLS for CYP2 and ABL were 63.2 and 27.1 s.This advantage of TS over REM is assumed to come from its divide-and-conquer nature and would be further extended when applied to data with very large ensemble sizes (m i ), which is usual in the case of MD trajectory data.The slow speed of REM was partly related to the slow convergence of the EM algorithm.The use of Newton's or similar methods might improve this drawback to some extent.Such improvement would be more valuable in higher-level models than in the two-level models.In theory, the two-level models in this study can be easily extended to higher-level models; however, it would be computationally expensive.
The assumption of normality is not always valid and there may be cases where the methods in this work pose problems.Of particular concern are small samples with outliers.In fact, in the numerical tests of this study, TS/IWLS occasionally failed at the first stage when the ensemble size was small (m i = 3).However, the normal distribution may be viewed as an approximation to the original distribution and is usually used in mixed/random effects models ( [19], p. 233).In addition, the effective sample sizes are enlarged in the REM and SPE methods, because the estimations are carried out using structures of all ensembles.Nevertheless, it would be important to validate results, for example, by visually inspecting the superimposed ensemble-average structures and the projections of the structures onto the planes of the estimated PCs.
This study developed methods that employ superposition in favor of intuitively appealing Cartesian coordinates, rather than internal coordinate systems.Recently, sophisticated methods for analyzing structure ensembles in the Cartesian coordinates space were reported.Hirsch and Habeck applied a Gaussian mixture model (GMM) to model the intra-ensemble variability within a single ensemble (or within a single protein chain) [48].Although their superposition method corresponded to OLS, the heteroscedasticity was accounted for using GMM.Jamroz et al. [49] proposed a score named FlexScore, which is based on the estimated mean structure and covariance matrix of a single ensemble.They used the framework of Theobald and Wuttke [22] for ensemble superposition.Lindorff-Larsen and Ferkinghoff-Borg [50,51] proposed a similarity measure between two ensembles; that is, the Kullback-Leibler divergence between the two normal distributions, which were represented by their respective mean structures and covariance matrices.These studies concerned the distribution of a single ensemble, whereas REM and TS in this paper are methods for multiple ensembles.
A possible limitation of these methods is that they arise from superpositions.There is controversy over superposition in the context of PCA and the clustering of MD trajectories, particularly when applied to extreme structural changes and perhaps to too large ensembles [52,53].Nevertheless, two-level superposition alleviates this drawback by dividing the entire dataset into ensembles, which also mitigates the concern that the average structure used for superposition is not a real protein conformation.In addition, the heteroscedasticity-aware strategy of REM and IWLS is expected to improve the quality of superposition over OLS, as indicated by the results of the numerical tests.
In an additional application on calmodulin (Supplementary Section S5), in which a mixture of nuclear magnetic resonance (NMR) and X-ray ensembles was used, REM and IWLS successfully captured large structural changes.Calmodulin consists of two domains, N-and C-domains, that are connected by a f lexible linker.The superposition (Figure 9A), obtained using the REM method, indicates that large conformational changes occurred between the two domains, which are difficult to interpret in the TS/OLS results (Figure 9B).The REM method yielded the superposition fitted solely at the N-domain despite the high structural similarity between N-and C-domains.On the other hand, when more than seven N-terminal residues were excluded from the calculation, C-domain fitted superpositions were obtained as shown in Supplementary Figure S6A.A similar result was also observed in the TS/IWLS method (Supplementary Figure S6B), although its response to the changes in the number of residues was different from that of the REM method.
It should be noted that the final superposition depends on the initial superposition, that is the problem of the multiple local minima.For the calmodulin data, some examples are shown in Supplementary Figure S7 and discussed in Supplementary Section S5 (pp.s23-s24).Thus, it may be meaningful to seek other superpositions or to confirm that the superposition is a global optimum, by initiating from different superpositions or only using subsets of data.
There seem to be several approaches for improving or extending the multilevel superposition reported in this paper.These methods are restricted to datasets of complete cases.It will be valuable to investigate methods of relaxing these restrictions, to allow for the analysis of structures with gaps and/or with disordered atoms.Another concern is that the The yellow and blue lines represent the average structures of the NMR and X-ray ensembles, respectively.For example, the uppermost helix in (A) is the average structure of that of an X-ray ensemble.Calmodulin consists of N-and C-domains and a link connecting them.In the structures superimposed using the (heteroscedastic) REM method, compared to the C-terminal half, the N-terminal half exhibited a better fit.accuracy or reliability of X-ray and NMR structures varies between experiments, unlike structures generated from MD simulations.If a measure that describes the reliability of each experiment was established, it would be used to impose differential importance on individual experiments using, for example, a composite likelihood [ 54], to which a major portion of the current work could be applied.Studies on the utility of the estimated covariance matrices may also yield meaningful results.For example, this information may be used in the refinement or validation of newly obtained structures.It is also interesting to incorporate a kernel approach to cope with non-linear motion and heterogeneity [55].The superpositionfree application of mixed/random effects models to the internal coordinates, e.g.inter-atomic distances and dihedral angles, seems to be straightforward, although application-specific investigations will be required.Lastly, owing to the similarity between the two methods future work incorporating GMM, as a clustering tool, in the REM method would serve as an interesting approach.
In the REM method, one of the benefits of subjecting multiple ensembles rather than a single ensemble is that the entire dataset is used to improve the superposition and estimation in small ensembles of insufficient structures.Computationally, REM is more expensive than TS on large-scale problems.These points together suggest that the REM method is well suited for problems with small-sized ensembles, particularly for experimentally obtained structures.Its results, which are given in Cartesian coordinates, may be appealing to researchers in experimental fields, although the scope of the REM method is not limited to X-ray and NMR structures.It is also interesting to apply the REM and TS methods to clusters of structures in sequence-based prediction [56].As illustrated in this study, the REM method is useful for comparative studies of protein structures, providing a tool for investigating the dynamics and heterogeneity of protein conformations in terms of inter-and intra-ensemble variability.

Key Points
• A mixed-model method was developed for estimating two covariance matrices that represent inter-and intraensemble variabilities.
• The method incorporated multilevel superposition in the Cartesian coordinate space.• The results showed that the conformational variation between ensembles can be effectively extracted, using principal component analysis, from the inter-ensemble covariance matrix estimated by the method proposed in this paper.• The other covariance matrix contained information on the variability spanning the ensembles, which resulted from the removal of inter-ensemble differences from the observed variability.• As an example of cytochrome P450 family 2 enzymes, in which each enzyme has an active site cavity of various sizes, an open/closed motion shared between the enzymes was captured in the first principal mode of the intra-enzyme covariance matrix.

Figure 1 .
Figure 1.The MAE of the estimates of standard deviations.The upper four panes are the plots of the MAE of the square root of the diagonal components of Ŵ for ν = 10 (A), 20 (B), 40 (C) and 80 (D).The lower panes are the respective plots for Σ.The averages for true values for W and Σ were 0.52 and 0.51, respectively.Lower MAE indicates higher accuracy.The accuracy of the TS/OLS estimates was lower than those of the other two methods.The accuracy of the REM Ŵ was slightly higher than the other two at low m i .

Figure 2 .
Figure 2. The RSMIP values of the top two eigenvectors as a function of m i .The upper four panes are the plots for Ŵ with ν = 10 (A), 20 (B), 40 (C) and 80 (D).The lower panes are the respective plots for Σ.At low m i , the accuracy of REM estimates was slightly higher than those of the other methods.

Figure 3 .
Figure 3.The plot of openness and coordinates of the CYP2 structures along the PC1 of the covariance matrices: (A) PSE/IWLS Δ, (B) REM Ŵ and (C) REM Σ. Openness was inferred from descriptions in the literature.ND is the abbreviation for non-determined and indicates that informative descriptions were not found in the literature.The PC1 of REM Σ seems to have a better correlation with openness.Notice the difference in the scale of the vertical axes.

Figure 4 .
Figure 4. Open and closed structures of CYP2 generated in accordance with the PC1 of the estimated covariance matrices.The traces of Cα are colored by the residue index, from red (C-terminal) to blue (N-terminal).The open structures (A) and (B) were obtained by adding the highest coordinates of the PC1 of SPE Δ and REM Σ, respectively, throughout the structures to the estimated mean structures.The closed structures (C) and (D) were obtained by similarly adding the lowest coordinates of the PC1 of SPE Δ and REM Σ, respectively.Structures (A) and (B) show the larger cleft produced by the upward shift of the F-G region.In addition, in structure (A), the upward shift of the B-C loop is remarkable, enlarging the active site cavity as shown by the elongated rod dist3.The three rods are drawn as defined in Table1.

Figure 5 .
Figure 5.The average magnitudes of per-residue deviations of the CYP2 conformations along PC1.(A) SPE Δ, (B) REM Ŵ and (C) REM Σ.Each deviation is the average over the 162 conformations.The residue numbers of 2C8 are shown.The capital letters and bars at the bottom indicate the major helices and their regions (adopted from Ref. 37).

Figure 6 .
Figure 6.Projections of the X-ray structures of the β-lactamases onto the PC1-PC2 plane of the covariance matrices: (A) REM Ŵ and (B) REM Σ. (A) The structures were classified into loosely three clusters: non-ESBLs, half of the ESBLs, and carbapenemases and the other half of the ESBLs.The outliers, 6afmA and 6afnA, of PenL are the structures of a noncatalytic-region mutant C69Y[39].(B) The outliers of SHV-1 and TEM-1 were neighbors of each other's major clusters.The structures 1pzpA and 1pzoA were reported as core-disrupted structures[43].All structures of the SHV-1-BLIP complex were located in the vicinity of the TEM-1 cluster.The other three chains (A-C) of 5eph were not included in the analysis because of the insufficient coordinates of Cα atoms.

Figure 7 .
Figure 7.The superimposed mean structures of the β-lactamases.The cartoon-drawing is the mean structure of TEM-1.The other structures are shown in colored traces.The gray, blue and orange lines are the structures of ESBL, non-ESBL and carbapenemases, respectively.The location of β9-α12 loop was noticeable.The locations of the carbapenemases and the half of the ESBLs were shifted to cover the β7-β8 loop which faced the Ω-loop.

Figure 8 .
Figure 8. Displacement of the Cα atoms of each structure of the β-lactamases.The coordinates in the first three PCs of (A) SPE Δ, (B) REM Ŵ and (C) REM Σ were transformed to the deviations in the Cartesian coordinates (Å) and plotted per residue.The peak occurred at the hinge-α11 region (218-223) for the deviation along the PC1 of Σ (c, left).Ω-, β7-β8, and β9-α12 (169-176, 238-243, 266-275, respectively) constituted triple peaks in the plot for the PC2 of Σ (C, middle).The standalone peak in (C, right) was produced mainly by the outliers in BEL-1 (5ephD).The PC1s of Δ and Ŵ were almost identical as indicated by the dot product between the two as high as 0.961.

Figure 9 .
Figure 9. Superimposed mean group structures of calmodulin.(A) REM and (B) TS/OLS.The traces of the C α atoms are shown.The yellow and blue lines represent the average structures of the NMR and X-ray ensembles, respectively.For example, the uppermost helix in (A) is the average structure of that of an X-ray ensemble.Calmodulin consists of N-and C-domains and a link connecting them.In the structures superimposed using the (heteroscedastic) REM method, compared to the C-terminal half, the N-terminal half exhibited a better fit.

Table 1 :
Average distances showing the cavity size of each enzyme a .The values are the averages over the whole conformations of each enzyme